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Abstract 

Mismatching problem between the source and target noisy cor- 
pora severely hinder the practical use of the machine-learning- 
based voice activity detection (VAD). In this paper, we try to ad- 
dress this problem in the transfer learning prospective. Transfer 
learning tries to find a common learning machine or a common 
feature subspace that is shared by both the source corpus and 
the target corpus. The denoising deep neural network is used 
as the learning machine. Three transfer techniques, which aim 
to learn common feature representations, are used for analysis. 
Experimental results demonstrate the effectiveness of the trans- 
fer learning schemes on the mismatch problem. 
Index Terms: deep learning, domain adaptation, feature learn- 
ing, transfer learning, voice activity detection. 

1. Introduction 

Voice activity detectors (VADs) aim to discover speech from 
its background noises. They are important frontends of modern 
speech recognition systems [1-3] and speech signal processing 
systems [4]. Recently, the machine-learning-based VADs [5-9] 
have received much attention in that they not only can be inte- 
grated to the speech recognition systems naturally but also can 
fuse the advantages of multiple features [10-15] much better 
than traditional VADs. However, the machine-learning-based 
VAD is still far from its practical use. One significant prob- 
lem is that we are not sure whether the VAD model trained in 
a given source corpus is still powerful in a target corpus which 
might have a different distribution with the source corpus. 

In this paper, we try to deal with the aforementioned prob- 
lem by a novel learning method - transfer learning. Generally, 
transfer learning tries to make the model trained with one or 
multiple source tasks generalizes well on different but related 
target tasks, so that the performance gap between the source 
tasks and the target tasks can be lowered. See [16] for an ex- 
cellent survey on transfer learning. In respect of different hy- 
pothesis on whether the source data or target data is manually 
labeled, the transfer learning technologies can be categorized 
into four groups [16]. This paper focuses on the domain adap- 
tation techniques, where the source data is manually labeled 
and the target data is unlabeled which is a practical scene that 
the machine-learning-based VAD will meet. In respect of what 
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to transfer, the transfer learning methods can be categorized to 
three groups - instance transfer, feature transfer, and parameter 
transfer (i.e. model transfer) [16, 17]. This paper focuses on 
the feature transfer techniques. Generally, feature transfer tries 
to learn a low-dimensional feature representation that is shared 
by both the source data and the target data, so that the classifier 
trained on the source data with the learned subspace can gen- 
eralize well on the target data. The main contributions of this 
paper are summarized as follows: 

1. Towards the mismatching problem of the machine- 
learning-based VAD. We have conducted an extensive 
experiment from the domain adaptation perspective for 
the mismatching problem. The recently proposed de- 
noising deep neural network (DDNN) [18] is used as 
the learning machine. Empirical results show that the 
transfer learning schemes are more powerful than several 
state-of-the-art VADs when the source and target cor- 
pora are relatively similar. The results also demonstrate 
the promising future of the practical use of the machine- 
learning-based VADs. 

2. A useful empirical comparison of three feature-based 
domain adaptation schemes. We have proposed three 
domain adaptations for the DDNN-based VAD. Empiri- 
cal results show that we can pre-train the deep neural net- 
works in an unsupervised manner either with the source 
data only or with both the source data and the target data 
together, but the data for all layers' pre-training have 
to be the same without interference, which manifested 
the powerfulness of the pre-training scheme proposed by 
Hinton etal. [19]. 

We have to note that the main purpose of this paper is to dis- 
cuss the effectiveness of the transfer learning for the mismatch- 
ing problem between the source data and the target data but not 
to propose a specific VAD algorithm. To make the machine- 
learning-based VAD work well in practice, a lot of efforts are 
still needed. As an example, the DDNN-based VAD needs the 
clean speech signals of its noisy speech corpus in the unsuper- 
vised pre-training stage, which is an ideal situation. 

The remainder of the paper is organized as follows. In Sec- 
tion 2, we first review the recently proposed DDNN and then 
present three feature-based domain adaptation schemes for the 
DDNN-based VAD. In Section 3, we present the related work. 
In Section 4, we conduct an extensive experimental comparison. 
In Section 5, we conclude this paper with some future work. 



Scheme 1 A successful scheme. 

1: Pre-train all layers of DDNN with only the target corpus 

*<«>. 

2: Fine-tune the pre-trained DDNN with only the labeled 
source corpus, i.e. X^ x y^. 



Scheme 2 A successful scheme. 

1: Take the source corpus X^ and the target corpus X^ to- 
gether as a large corpus. 

2: Pre-train all layers of DDNN with the large corpus. 

3: Fine-tune the pre-trained DDNN with only the labeled 
source corpus, i.e. X^ x y^. 



2. Domain Adaptation for VAD 

In this section, we first review the denoising deep neural net- 
work (DDNN), and then propose three feature-based domain 
adaptation schemes for the DDNN-based VAD. 

2.1. Review: Denoising Deep Neural Network for VAD 

DDNN [18] is a deep neural network. It was motivated from the 
stacked denoising autoencoder [20,21]. Compared to the deep- 
belief-network-based VAD [22], it has achieved a success on 
the performance of the deep layers over shallower layers. The 
key idea of DDNN is to first minimize the reconstruction cross- 
entropy loss between the noisy speech signal and its correspond- 
ing clean speech signal in an unsupervised greedy layer-wise 
pre-training way, and then fine-tune the entire deep neural net- 
work by minimizing the cross-entropy loss between the noisy 
speech signal and its manual labels for the minimum classifica- 
tion error. One special point of DDNN is that, in the pre-training 
phase, DDNN needs to train an accompanying deep neural net- 
work, i.e. a deep network that tries to reconstruct the clean 
speech signal from the clean speech signal. This is mainly to 
supply the noisy speech its optimization objective in each layer. 

From the aforementioned, we can see that one weakness 
of DDNN is that the noisy speech signal needs its correspond- 
ing clean speech signal in the pre-training phase. Because this 
paper focuses on the effectiveness of the transfer learning, this 
weakness does not hinder the main contributions of the paper. 

2.2. Preliminary of the Feature-Based Domain Adaptation 

Suppose we have a labeled source corpus X^ x y (s \ and an 
unlabeled target corpus X^\ where X denotes the acoustic fea- 
ture corpus and y denotes the set of the manual labels. The 
corpora X (s) and X {t) might be sampled from different noise 
scenarios. The feature-based domain adaptation scheme aims 
to find a mapping function (/>(■) such that the distribution dif- 
ference between <f> \X^ J and <f> yX^ J is minimized. There- 
fore, if we minimize the classification error on the source corpus 
with the new feature representation, i.e. , we can also 

expect to minimize the classification error on the target corpus. 

2.3. Domain Adaptation Via Deep Feature Extraction 

For the DDNN-based VAD, a number of training schemes for 
4>(-) can be developed. The core idea of the development is to 
first pre-train DDNN in different unsupervised ways and fine- 
tune DDNN with the labeled source corpus, i.e. A" (s) x y <s \ 
for the minimum classification error. 



Scheme 3 A failed scheme. 

Input: The desired depth of DDNN, denoted as L, (i.e. the 

hidden-layer number). 
1 : Source DDNN pre-training: Pre-train the source DDNN 

with a depth of L — 1 with only X^ 3 \ The pre-trained 

source DDNN is denoted as {W; S) j . /*Note: this 

model needs to be trained only once, and used repeatedly 
for different target corpus.*/ 
2: Target DDNN pre-training: Pre-train the source DDNN 
with a depth of L — 1 with only X^\ The pre-trained target 

DDNN is denoted as { W[ () j 

3: Hybrid pre-training of the top layer: Group the output 
features of the source DDNN and target DDNN together to 
a large set, and pre-train the L-th layer of DDNN with the 
large set. The pre-trained model is denoted as W^. 

4: if Scheme 3 (t) then 

5: Fine-tune the pre-trained ||w^| >W^j with 

only the labeled source corpus, i.e. X^ x y("\ 
6: else if Scheme 3 (s) then 

7: Fine-tune the pre-trained 1 1 W ; (s) | ^ ^ ^^jwith 

only the labeled source corpus, i.e. X <s) x y("\ 
8: end if 

9: Output the fine-tuned network as the learned DDNN. 



In this paper, we present three unsupervised pre-training 
schemes, which are described in Schemes 1, 2, and 3 respec- 
tively. The effectiveness and efficiency of the three schemes are 
analyzed qualitatively as follows: 

Scheme 1 only uses the unlabeled target data X^' to ini- 
tialize DDNN. Because it uses only X^' for pre-training, it is 
supposed to be a relatively poor initialization scheme but com- 
putationally efficient. Moreover, when is small, the initial 
point of DDNN might be biased and still suffer from overfitting, 
hence, the network might not be trained well. 

Scheme 2 uses both X (s) and X (t) for initializing DDNN, 
which can learn a good feature representation shared by X^ 
and X^\ Particularly, when X (t ^ is rare, X^ can play a suffi- 
cient supplementary role to X^K Hence, the network is desired 
to perform gently well on the target test data. However, when- 
ever we meet a new target task, we have to conduct a heavy 
computation load by training X^ and X^ jointly in the pre- 
training phase. 

Scheme 3 is designed to be a compromise between Scheme 
1 and Scheme 2. Specifically, because we take the supple- 
mentary effect of X^ merely into the highest layer of DDNN 
which is a layer that directly influences the performance of 
DDNN, we might not only transfer the source knowledge to the 
target domain but also can save a lot of training time, since that 

1 ) the most computationally expensive part of DDNN is the pre- 
training of the source DDNN which can be trained once for all; 

2) the top layer of DDNN usually has much less hidden units 
(i.e. much less training time) than the bottom modules. Scheme 
3 contains two sub-schemes, which is denoted as Scheme 3^ 
and Scheme S* 3 ' respectively. 

Before the experimental section, we emphasize that 1) 
Schemes 1 and 2 are successful ones because the data for all 
layers' pre-training are the same, 2) Scheme 3 fails in providing 
a good initial point for DDNN, because the data for pre-training 



all layers is not consistent. The main purpose that we want to 
share the failed scheme is to tell the critical readers that it pro- 
vides a compromise thinking between the computationally light 
Schemes 1 and the computationally heavy Scheme 2, and we 
might find a successful compromise scheme that is both as ef- 
fective as Scheme 2 and as efficient as Scheme 1 in the future. 

Note that we can use multiple source corpora and multiple 
target corpora together to train the model freely. But in this pa- 
per, we only discuss the empirical performance with one source 
corpus and one target corpus, leaving the multiple source do- 
main adaptation problem to a future discussion. 

3. Related Work 

In respect of transfer learning and deep learning, there has been 
some similar work with the proposed schemes. For example, 
in [23], Glorot et al. adopted a domain adaptation scheme that 
is exactly the same as Scheme 2 of this paper for the sentiment 
classification problem. In [24], Collobert and Weston proposed 
a joint training scheme for the multitask learning problem of 
natural language processing, whose key idea is similar with 
Scheme 3. The architecture of [24] is also successfully applied 
to machine translation [25]. However, Scheme 3 is different 
from [24, 25] in that our Scheme 3 pre-train the top hidden- 
layer of DDNN with set j X (s) , AT (t) |, while the architecture 

of [24, 25] try to learn a subspace of word mapping in the top- 
hidden layer with a strong constraint that one word in the lookup 
table of X^ should have a matching word in that of X^K 

In respect of the VAD study, the distribution difference be- 
tween different noise scenarios has been mentioned in tradi- 
tional VADs. For example, in [26], Chang et al. used different 
statistical models for modeling the speech and noise distribu- 
tions in different noise scenarios. Another related topic with 
domain adaptation is the online learning methods [27], they up- 
date the model parameters according to the historical domain 
information of the speech signals. Traditional statistical-model- 
based VADs [26] can also be regarded as unsupervised online 
learning methods. But to our knowledge, how to combine mul- 
tiple features effectively is still an open problem in the online 
learning methods. On the other side, although the domain- 
adaptation-based VAD works in batch mode, it can combine 
multiple features effectively and yield a high accuracy without 
a requirement of heavy manual labeling. 

4. Experiments 

Seven noisy test corpora of AURORA2 [28] is used for per- 
formance analysis. The signal-to-noise ratio level of the audio 
signals is set to 5 dB. Each test corpus of AURORA2 contains 
1001 utterances, which are split randomly into three groups for 
training, developing and test respectively. Each training set and 
development set consist of 300 utterances respectively. Each 
test set consists of 401 utterances. 

The sampling rate is 8kHz. We set the frame length to 25ms 
long with a frame-shift of 10ms. We extract 10 acoustic features 
from each observation. The detailed information of the features 
are listed in Table 1. All features are normalized into the range 
of [0, 1] in dimension. 

To simulate the real-world domain adaptation task, we take 
the training sets of the Street and Subway noise scenarios as 
two source corpora. For each source corpus, we form 6 domain 
adaptation tasks by randomly extracting a 30-second audio seg- 
ment from the training set of each noise type of AURORA2 ex- 



Table 1: Features and their attributes. The subscript of each 
feature is the window length of the feature [29]. 



ID 


Feature 


Dimension 


ID 


Feature 


Dimension 


1 


Pitch 


1 


7 


MFCCis 


20 


2 


DFT 


16 


8 


LPC 


12 


3 


DFTg 


16 


9 


RASTA-PLP 


17 


4 


DFTig 


16 


10 


AMS 


135 


5 


MFCC 


20 




Total 




6 


MFCC 8 


20 









cept that of the source corpus. For each domain adaptation task, 
the development set of the source corpus is used for model se- 
lection. We run each domain adaptation task 5 times and report 
the average accuracies. 

The parameters are set as follows. Up to three hidden layers 
are adopted with the numbers of the hidden units set to [54, 7, 7] 
respectively. The learning rate of the unsupervised pre-training 
is set to 0.004. The maximum epoch of the unsupervised pre- 
training is set to 200. The learning rate of the supervised fune- 
tuning is set to 0.005. The maximum epoch of the supervised 
fune-tuning is set to 130. The batch mode training is adopted. 
Each batch contains 512 observations. Note that the parameters 
are selected empirically for a compromise between the training 
time and the accuracy. The accuracy might be further improved 
by tuning the parameters. 

To evaluate the effectiveness of the proposed domain adap- 
tation schemes, we give the empirical lower bound and upper 
bound that the schemes might achieve. The lower bound, de- 
noted as "LB", is obtained by training DDNN with only the 
source corpus and testing it on various target environments. If 
the performance of the proposed domain adaptation schemes is 
worse than LB, it means that the schemes fail. The performance 
upper bound, denoted as "UB", is obtained by training DDNN 
with the training set of the target corpus and testing it on the 
test set of the same target environments. If the performance of 
the proposed domain adaptation schemes is better than the UB, 
it means that the schemes achieve unbelievably amazing suc- 
cesses. We also compare with the G.729B VAD [30], ETSI ad- 
vanced frontend via Wiener filter [31], ETSI advanced frontend 
via frame dropping [31], Sohn VAD [32], Ramirez05 VAD [29], 
Ramirez07 VAD [33], Yu VAD [5], Shin VAD [34], and Ying 
VAD [27], The experimental settings are exactly as [22] did. 

4.1. Experimental Results 

First, we give the Hinton diagram of the feature distributions in 
different noise scenarios in Fig. 1. From the figure, we can see 
that most feature distributions are relatively similar with each 
other except the subway noise scenario, which means the trans- 
fer learning schemes might be useful. 

Table 2 lists the transfer accuracies with the Street noise as 
the source corpus. From the figure, we can see that in all lay- 
ers, Scheme 2 is the most powerful one, followed by Scheme 
1. Both Scheme 2 and Scheme 1 achieve higher accuracies than 
LB, which means the positive transfer [16] phenomenon is ob- 
served. However, both Scheme 3'*' and Scheme 3' s ' are not 
only worse than Schemes 1 and 2, but also sometimes slightly 
worse than LB, which means the negative transfer [16] is ob- 
served. This phenomenon is rather important. It manifested 
empirically that using the greedy layer-wise pre-training to ini- 
tialize the deep network is valuable. If we interrupt the initial 
point of some layer by noises, or if the data for pre-training are 
inconsistent in all layers, the performance drops dramatically. 

Table 3 lists the transfer accuracies with the subway noise 



Table 2: Transfer accuracy comparison (in percentage) with the Street noise corpus (identification = 4) as the source data. "LB" is 
short for the lower bound, "SI" is short for Scheme 1, "S2" is short for Scheme 2, "S3'''" is short for Scheme 3'*', "S3' s ^" is short 
for Scheme 3' s \ and "UB" is short for upper bound. "# layers" means that the depth of the DDNN is "#". Because the experimental 
environment settings are exactly as [22] did, we just copy the results of the referenced VADs from [22]. Due to the length limit, we only 
report the best performance of the referenced VADs and its corresponding VAD algorithm. The referenced methods that are marked 
with "*" means that they are machine-learning-based VADs that are trained and tested in the matching environments. 



ID 


Noise Type 


Referenced 


1 layer 


2 layers 


3 layers 


LB 


St 


S2 


UB 


LB 


SI 


S2 


S3<" 


S3<"> 


UB 


LB 


SI 


S2 


S3"> 


S3<*> 


UB 


] 


Babble 


75.51 (Ramirez05) 


74.95 


77.15 


76.44 


78.61 


74.09 


75.67 


76.59 


75.73 


73.17 


78.85 


72.72 


75.53 


75.92 


73.74 


72.84 


79.14 




Car 


79.25 (G.729B) 


81.89 


82.91 


83.51 


86.77 


82.05 


82.08 


83.17 


82.19 


81.73 


86.96 


81.49 


81.81 


82.92 


81.11 


82.20 


87.09 


3 


Restaurant 


69.59 (Ramirez05) 


74.44 


75. 14 


75.74 


83.47 


73.84 


75.34 


75.61 


74.53 


74.17 


83.90 


73.25 


75.59 


75.19 


75.76 


73.31 


83.78 


5 


Airport 


72.45 (Shin)« 


77.35 


77.92 


77.56 


82.34 


77.12 


77.82 


77.88 


77.51 


77.32 


82.45 


76.73 


77.34 


77.86 


77.18 


76.96 


82.30 


6 


Train 


75.26 (G.729B) 


80.51 


81.69 


81.22 


83.88 


80.47 


80.64 


82.27 


81.42 


80.88 


84.21 


79.70 


80.76 


81.89 


80.50 


79.90 


84.25 


7 


Subway 


73.16 (Ramirez05) 


68.19 


68.19 


69.87 


85.84 


68.35 


72.69 


76.28 


68.22 


68.17 


85.62 


68.44 


74.49 


76.42 


70.70 


68.26 


85.73 



Table 3: Transfer accuracy comparison (in percentage) with the subway noise corpus (identification = 7) as the source data. 



ID 


Noise Type 


Referenced 


1 layer 


2 layers 


3 layers 


LB 


SI 


S2 


UB 


LB 


SI 


S2 


S3<»> 


S3<"> 


UB 


LB 


SI 


S2 


S3<" 


S3<'» 


UB 


1 


Babble 


75.51 (Ramirez05) 


54.58 


54.60 


62.05 


78.61 


54.58 


54.59 


67.59 


54.58 


54.58 


78.85 


54.58 


54.58 


68.11 


54.59 


54.58 


79.14 


2 


Car 


79.25 (G.729B) 


55.80 


55.80 


68.24 


86.77 


59.54 


58.05 


69.19 


63.09 


64.11 


86.96 


61.33 


57.96 


70.05 


56.52 


58.65 


87.09 




Identification of the noise scenario 



Figure 1 : Hinton diagram of the feature distributions in different 
noise scenarios. Each grid of the Hinton diagram measures the 
distribution similarity of the features in the relevant two scenar- 
ios. The bigger the grid is, the more similar the two distributions 

are. The similarity is calculated as exp ^— ||c' s ' — c*-*'|| 2 /2J 

[35] with c as the feature centroid. 



ilar, the accuracy of the DDNN-based VAD drops slightly with 
respect to the depth of the network. One possible explanation is 
that the distributions can be sufficiently covered by the source 
corpus, so that we can achieve a desired performance with just 
one hidden layer of DDNN. On the contrary, when the feature 
distributions are dissimilar, the accuracy increases dramatically 
with respect to the depth of the network, which demonstrates 
the power of the transfer learning schemes. However, when 
compared with the referenced methods, we can observe that 
when the source and target environments are relatively simi- 
lar, the DDNN-based VAD outperforms the referenced meth- 
ods. But when the environments are severely dissimilar, the 
DDNN-based VAD is weaker than the referenced ones. 

Table 4 lists the pre-training time comparison between the 
schemes. From the table, we can see that Scheme 1 is the most 
efficient one, and Scheme 3 is slightly slower than Scheme 1. 



Table 4: Pre-training time (in seconds) comparison. 





1 layer 


2 layers 


3 layers 


SI 


570.48 


713.21 


774.87 


S2 


11200.19 


12552.96 


12838.95 


S3 




Source 


Hybrid 


Source 


Hybrid 




12055.02 


1860.34 


12592.76 


985.92 



as the source corpus. Due to the length limit, we only show the 
results of two target noise corpora. The experimental phenom- 
ena in other noise scenarios are similar with the two. From the 
table, we can see that due to the significant difference between 
the subway noise and the target corpus, the accuracies of all 
schemes drop significantly from UB. However, we can also ob- 
serve that the accuracies yielded from Scheme 2 are still signif- 
icantly better than LB and are upgraded layer by layer, which 
means that the positive transfer is observed too. 

As a conclusion, 1) the proposed schemes are effective in 
dealing with the mismatching problem between the source data 
and the target data; 2) Schemes 1 and 2 are both effective trans- 
fer learning schemes; 3) Initialization via unsupervised greedy 
layer-wise pre-training is valuable. 

Several other interesting phenomena can be observed by 
comparing Table 2 and Table 3. We can observe that when the 
feature distributions of the source data and target data are sim- 



5. Conclusions 

In this paper, we have tried to solve the mismatching problem 
between the source corpus and the target corpus in the transfer 
learning perspective, and further tried three DDNN-based do- 
main adaptation schemes for the problem. Experimental results 
have shown that Schemes 1 and 2 are effective in dealing with 
the mismatching problem of VAD when compared with the tra- 
ditional training method, while Scheme 1 is much more efficient 
than Scheme 2. The results also have shown that the layer-wise 
pre-training strategy is important for the success of the deep- 
leaming-based transfer learning schemes. Although Scheme 3 
is failed, it does provide an attempt on the compromise between 
the training time and accuracy, and provide a contrary example 
for showing the effectiveness of the layer-wise pre-training. 

Experimental results also have shown that when the source 
and target corpora are very dissimilar, the performance might 
be weaker than the referenced methods. For solving this, pre- 
training with more unlabeled target data, with multiple source 
domain, and with more hidden layers might be helpful. More- 
over, how to make the performance of Scheme 2 more closer to 
the upper bound, how to accelerate Scheme 2 and meanwhile 
keep its effectiveness are also what we want to address. We 
leave these problems as the future work. 
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